Written by Savahnna L. Cunningham

Date: October 17, 2017

The Red Wine dataset is publicly available for research. The details are
described in [Cortez et al., 2009].

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine
preferences by data mining from physicochemical properties. In Decision Support
Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at:

Introduction

The goal of this analysis is to quantify and gain an understanding of how
chemical properties impact the quality rating of red wine. The dataset contains
1599 red wine samples with 11 variables, quantifying the physicochemical
properties of wine. The wine samples in this dataset are related to red variants
of the Portuguese “Vinho Verde” wine.

A multiple regression analysis will be conducted on the dataset to test how
changes in the 11 independent physicochemical properties predict a level of
change in the quality rating of a wine. The f-test will be used to determine
which predictor variables merit inclusion in the model.

The statistical hypotheses for this analysis are as follows:

H0 (Null Hypothesis): Combinations of the 11 independent physicochemical
properties (μI) have no relationship in predicting the outcome of the dependent
quality rating of a wine (μD), which can be mathematically represented as

H0: μI = μD

H1 (Alternate Hypothesis): Two or more of the 11 independent physicochemical
properties (μI) predict the outcome of the dependent quality rating of a wine
(μD), which can be mathematically represented as

HA: μI > μD

Attribute information:

Input variables (based on physicochemical tests):

 1 - fixed acidity (tartaric acid - g / dm^3)
 
 2 - volatile acidity (acetic acid - g / dm^3)
 
 3 - citric acid (g / dm^3)
 
 4 - residual sugar (g / dm^3)
 
 5 - chlorides (sodium chloride - g / dm^3
 
 6 - free sulfur dioxide (mg / dm^3)
 
 7 - total sulfur dioxide (mg / dm^3)
 
 8 - density (g / cm^3)
 
 9 - pH
 
 10 - sulphates (potassium sulphate - g / dm3)
 
 11 - alcohol (% by volume)
 
 Output variable (based on sensory data): 
 
 12 - quality (score between 0 and 10)

Description of attributes:

1 - fixed acidity: most acids involved with wine or fixed or nonvolatile
    (do not evaporate readily)

2 - volatile acidity: the amount of acetic acid in wine, which at too high 
    of levels can lead to an unpleasant, vinegar taste

3 - citric acid: found in small quantities, citric acid can add 'freshness' 
    and flavor to wines

4 - residual sugar: the amount of sugar remaining after fermentation stops, 
    it's rare to find wines with less than 1 gram/liter and wines with 
    greater than 45 grams/liter are considered sweet

5 - chlorides: the amount of salt in the wine

6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between 
    molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents 
    microbial growth and the oxidation of wine

7 - total sulfur dioxide: amount of free and bound forms of S02; in low      
    concentrations, SO2 is mostly undetectable in wine, but at free SO2     
    concentrations over 50 ppm, SO2 becomes evident in the nose and taste 
    of wine.

8 - density: the density of water is close to that of water depending on 
    the percent alcohol and sugar content

9 - pH: describes how acidic or basic a wine is on a scale from 0 
    (very acidic) to 14 (very basic); most wines are between 3-4 on the 
    pH scale

10 - sulphates: a wine additive which can contribute to sulfur dioxide 
     gas (S02) levels, wich acts as an antimicrobial and antioxidant

11 - alcohol: the percent alcohol content of the wine

    Output variable (based on sensory data): 
12 - quality (score between 0 and 10)

Methodology

Univariate Plots & Analysis

Summary table representing the 13 variable names. The X1 column represents the
wine ID. The ‘quality’ variable is the dependent variable and is qualitative
data based on a perceived like or dislike for the wine sample.

##        X1         fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##                                                                   
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##                                                        
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.43       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##  NA's   :2                                                             
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000  
## 

Creating a Categorical Variable

The variable quality is of numeric type ‘int’ and not conducive for data
analysis. The first step will be to change the numeric type of the quality
variable to a factor and add it to the data frame as a new variable
quality.rating. Additionally, three categories of quality will be added:
good (>= 7), bad (<=4), and mediocre (5 and 6).

# Gained inspiration for this code from the R-Bloggers website[6&7].

wine$quality.rating <- factor(wine$quality)
wine$quality.rating <- NA
wine$quality.rating <- ifelse(wine$quality>=7, 
                              'good', NA)
wine$quality.rating <- ifelse(wine$quality<=4, 
                              'bad', 
                              wine$quality.rating)
wine$quality.rating<- ifelse(wine$quality==5, 
                             'mediocre', 
                             wine$quality.rating)
wine$quality.rating <- ifelse(wine$quality==6, 
                              'mediocre', 
                              wine$quality.rating)
wine$quality.rating <- factor(wine$quality.rating, 
                              levels = c("bad", "mediocre", "good"))

The visualizations represent the distribution of the dependent variable analyzed
in the dataset. The upper plot is a histogram of the raw wine quality score.
As you can see, most of the wine samples have a score between 5 and 6. The raw
quality data was transformed into categorical data to better analyze the
information. Score with values of 4 or less were labeled as “bad”, scores
between 5-6 were labeled as “mediocre” and a score with a 7 or higher was
labeled “good”. The bottom visualization depicts the categorical distribution of
the quality score. As you can see, nearly all wine samples fall into the
mediocre category with “good” samples having ~250 samples in the dataset and
“bad” wine being the least common.

Normal Distribution Plots

The following independent variables have a normal or close-to-normal distribution:
fixed.acidity, volatile.acidity, density, pH and alcohol content with the
exception of citric acid, which has a bimodal distribution.

Transformations to near normality

The residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide and
sulphates variables do not have a normal or close-to-normal distribution. The
variables have right-skewed distributions, therefore, the data
will be transformed to near normality using a logarithmic function [4,13].

Synopsis

The red wine dataset contains 1599 red wine samples comprised of 11
physiochemical variables that affect a wine’s perceived quality.There were 5
physiochemical variables that had abnormal distributions. A logarithmic function
was used to better understanding of the distributions. The main features of
interest are the 11 independent variables and how they correlate to a wine’s
quality.


Bivariate Plots & Analysis

Density appears to have a small positive correlation with acids. Additionally,
pH has an inverse relationship with the acids, which is to be expected.

The following pairs of independent variables have a strong correlation (>0.5):

  • free.sulfur.dioxide vs total sulfur.dioxide
  • fixed acidity vs density
  • fixed acidity vs pH
  • fixed acidity vs citric acid

The exploratory data analysis will focus on the relationship between the
independent variables and the dependent quality rating variable.

Independent Variables vs. Quality Rating: A Positive Linear Relationship

The visualization indicates good wine contains a higher percentage of alcohol,
averaging ~11.5% by volume.

The visualization indicates good wine contains a higher concentration of
Tartaric Acid, averaging ~9 g/dm³ for good wine.

The visualization indicates good wine contains a higher quantity of citric acid.
As you can see, the quality greatly improves if a wine is has a citric acid
content range between 0.30 - 0.50 g/dm³, with the average citric acid
concentrationg ~ 0.35 g/dm³.

The visualization indicates there is not a large variance from a good and bad
wine with regards to Potassium sulphate concentration. As you can see, the
results show a good wine will have a Potassium sulphate mean concentration of
~0.75 g/dm³, while a bad wine will have a mean equal to ~ 0.55 g/dm³.


Independent Variables vs. Quality Rating: A Negative Linear Relationship

The visualization indicates that the greater the concentration of volatile acids
in a wine, the worse the quality rating. To have good marks, a wine is
considered good if it contains <0.4 g/dm³ acidic acid.

The visualization indicates that all wine samples are relatively close to the
density of water, averaging around 0.9975 g/cm³, however the wine samples with a
good quality rating have a slightly lower density, with a mean value equal to
~ 0.996 g/cm³.

The visualization shows mediocre and good wine have a pH value <= 3.25, in
comparison to a bad quality rating, which has a higher mean pH value >= 3.25.


pH and the Non-Volatile Acid Variables

The visualization represents the inverse relationship between pH and the weak
acids found in wine. Substances with a pH below 7.0 are termed acidic and
solutions with a pH above 7.0 are termed basic. As you can see, the red wine
samples as a whole are considered an acidic solution. As pH goes up, the less
acidic the wine becomes.


Density and the Non-Volatile Acid Variables

The visualization represents the affect acidity has on wine density. Acid
molecules are creating a stronger, closely packed bond compared to the
surrounding substance. Therefore, as acid molecules increase, the density of the
wine also increases.

Synopsis

The Bivariant analysis depicts notable relationships between wine quality and
the physiochemical characteristics. As you can see from the boxplots above,
there is a positive correlation between fixed acid, citric acid levels and wine
quality. The higher the non-volatile acid level, the better the wine quality.
Additionally, because acetic acid produces a vinegar taste, a negative
correlation can be found between the volatile acid variable and wine quality.

A good wine has the lowest density, which makes sense because density has a
direct correlation with total acidity concentration. However, it is interesting
to point out that there seems to be a fine line between total acidity level and
pH value. For a wine to be considered good, it has to have a low volatile
acidity level in conjunction with higher citric acid and fixed acid
concentrations but overall total acid levels should not pass a pH value of ~3.3.


Multivariate Plots & Analysis

The visualization compares the free Sulfur dioxide and the total Sulfur dioxide
variables to the dependent quality rating variable. As you can see, a strong
positive correlation exists between the dependent variable and the Sulfur
dioxide variables. The majority of good wine appears to have a free Sulfur
dioxide of <50 mg/dm³ and a total Sulfur dioxide concentration of
<100 mg/dm³.

The visualization compares the Fixed Acidity and pH variables with the dependent
quality rating variable. As you can see, there is an strong inverse relationship
between all of the variables. This is to be expected, as pH levels rise acidity
level decreases. Also of note, the good quality rating has a wide spread, even
distribution.

The visualization compares the Fixed Acidity and Citric Acid variables with the
wine quality rating. Results show a strong positive linear relationship between
the independent variables and wine quality. Interestingly, the majority of the
good quality data points cluster above a 0.25 g/dm³ citric acid value.

The visualization compares the Fixed Acidity and the Density variables with the
wine quality rating. Results indicate a strong positive correlation between the
independent variables, however, there does not appear to be any correlative
relationship between the independent and dependent variables. Interestingly,
the majority of the good quality data points are clustered at or below a
Tartaric Acid value of 0.4 g/dm³.


Mathematical Model

The goal of the multiple linear regression model is to predict wine quality
based on the chemical properties of a wine sample.

# Multiple Linear Regression
dataset = read.csv('wineQualityReds.csv')
dataset = dataset[, 2:13]

# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$quality, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)


# Note: Feature_Scaling will be taken care of with the function 

# Fitting Multiple Linear Regression to the Training set
regressor = lm(formula = quality ~ .,
               data = training_set)
summary(regressor)
## 
## Call:
## lm(formula = quality ~ ., data = training_set)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.66781 -0.36656 -0.06195  0.45616  1.96562 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.471e+01  2.370e+01   0.621 0.534945    
## fixed.acidity         2.265e-02  2.878e-02   0.787 0.431501    
## volatile.acidity     -9.534e-01  1.347e-01  -7.078 2.41e-12 ***
## citric.acid          -1.259e-01  1.619e-01  -0.778 0.436697    
## residual.sugar        1.043e-02  1.627e-02   0.641 0.521547    
## chlorides            -1.932e+00  4.586e-01  -4.213 2.70e-05 ***
## free.sulfur.dioxide   3.379e-03  2.487e-03   1.359 0.174485    
## total.sulfur.dioxide -3.005e-03  8.114e-04  -3.704 0.000222 ***
## density              -1.067e+01  2.418e+01  -0.441 0.659225    
## pH                   -4.486e-01  2.161e-01  -2.075 0.038143 *  
## sulphates             8.889e-01  1.311e-01   6.778 1.86e-11 ***
## alcohol               2.917e-01  2.975e-02   9.804  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6519 on 1266 degrees of freedom
## Multiple R-squared:  0.3519, Adjusted R-squared:  0.3462 
## F-statistic: 62.48 on 11 and 1266 DF,  p-value: < 2.2e-16
#Predicting the Test set results
y_pred = predict.lm(regressor, newdata = test_set,interval = "prediction",level = 0.95)

p1 <- smoothScatter(y_pred,pch = ".", cex = 5, 
                     col = "black",colramp = 
                     colorRampPalette(c("white", blues9)),
                     xlab = "Fit", 
                     ylab = "Model Prediction",
                     main="Predicted Future Values ")

The visualization represents the 95% prediction interval with data points
representing the models predicted values. As you can see, the model did very
well predicting the wine quality value, as all data points are within the
prediction interval [10].

# Plot a correlation matrix
regressor= cor(test_set[1:12])

par(mar=c(5,4,1.5,2) + 0.1)  #margin padding  
p1 <- corrplot(regressor, method = "circle",tl.cex = 0.6) + 
  title(main= "Regression Model Correlation Matrix",cex.main = 1.3) 

The volatile acidity, chlorides, total sulfur dioxide, alcohol, sulphates have
strong statistical significance on the depandent variable, while pH has a slight
statistical influence on wine quality. The model did very well, now it is time
to optimize it with the Backward Elimination method.

Model Optimization

# Building the optimal model using Backward Elimination
regressor = lm(formula = quality ~ fixed.acidity + 
                 volatile.acidity + 
                 citric.acid + 
                 residual.sugar + 
                 chlorides +
                 free.sulfur.dioxide + 
                 total.sulfur.dioxide + 
                 density + 
                 pH +
                 sulphates +
                 alcohol,
               data = dataset)  
summary(regressor)
## 
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
##     residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH + sulphates + alcohol, data = dataset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.68911 -0.36652 -0.04699  0.45202  2.02498 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.197e+01  2.119e+01   1.036   0.3002    
## fixed.acidity         2.499e-02  2.595e-02   0.963   0.3357    
## volatile.acidity     -1.084e+00  1.211e-01  -8.948  < 2e-16 ***
## citric.acid          -1.826e-01  1.472e-01  -1.240   0.2150    
## residual.sugar        1.633e-02  1.500e-02   1.089   0.2765    
## chlorides            -1.874e+00  4.193e-01  -4.470 8.37e-06 ***
## free.sulfur.dioxide   4.361e-03  2.171e-03   2.009   0.0447 *  
## total.sulfur.dioxide -3.265e-03  7.287e-04  -4.480 8.00e-06 ***
## density              -1.788e+01  2.163e+01  -0.827   0.4086    
## pH                   -4.137e-01  1.916e-01  -2.159   0.0310 *  
## sulphates             9.163e-01  1.143e-01   8.014 2.13e-15 ***
## alcohol               2.762e-01  2.648e-02  10.429  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.648 on 1587 degrees of freedom
## Multiple R-squared:  0.3606, Adjusted R-squared:  0.3561 
## F-statistic: 81.35 on 11 and 1587 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = quality ~ volatile.acidity + chlorides + total.sulfur.dioxide + 
##     pH + sulphates + alcohol, data = dataset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.60575 -0.35883 -0.04806  0.46079  1.95643 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.2957316  0.3995603  10.751  < 2e-16 ***
## volatile.acidity     -1.0381945  0.1004270 -10.338  < 2e-16 ***
## chlorides            -2.0022839  0.3980757  -5.030 5.46e-07 ***
## total.sulfur.dioxide -0.0023721  0.0005064  -4.684 3.05e-06 ***
## pH                   -0.4351830  0.1160368  -3.750 0.000183 ***
## sulphates             0.8886802  0.1100419   8.076 1.31e-15 ***
## alcohol               0.2906738  0.0168108  17.291  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6487 on 1592 degrees of freedom
## Multiple R-squared:  0.3572, Adjusted R-squared:  0.3548 
## F-statistic: 147.4 on 6 and 1592 DF,  p-value: < 2.2e-16

A multiple linear regression model was conducted on the dataset using the
backward elimination method. The findings indicate six independent variables
have a high statistical influence (p < 0.05) on the quality of a wine. A violin
plot was used to visualize the descriptive statistics of each influential
variable.

Take notice that in some cases, as with total Sulphur dioxide, chlorides, and pH
the distance between a “good” vs. “bad” wine is minute. However, the three
independent variables with the greatest statistical influence, alcohol, volatile
acidity and sulphates, do have a noticeable distance in mean quality rating
values. The results indicate that a good wine will have a high alcohol
percentage, low volatile acid concentration and a Potassium sulphate
concentration of ~15 g/dm³.

The visualization compares the two independent variables with the highest
statistical influence on wine quality. The results show wine quality has a
negative correlation with the independent variables, with the strongest negative
correlation seen in the 2,4,7,& 8 quality values.

Synopsis

The Multivariate analysis reveals strong statistical correlations with six of
the independent physicochemical properties. The scatterplot visualizations
indicate that “good” wine will have low concentrations Citric Acid, Tartaric
Acid, total Sulfur dioxide and free Sulfur dioxide.

An optimized multiple linear regression model using the Backward Elimination
method discovered alcohol, volatile acidity, sulphates, total sulfur dioxide,
chlorides and pH have a very strong statistical influence on wine quality.


Results

The univariate analysis revealed six independent physicochemical properties have
normal or close-to-normal distributions, while the remaining five properties
have right-skewed distributions, requiring a logarithmic function be used to
better understand the distributions.

The bivariate analysis revealed a positive linear relation between the
independent physicochemical properties alcohol, fixed acidity, citric acid and
sulphates and the dependent variable. A negative linear relationship exists
between volatile acidity, density, pH and the quality rating. An inverse
relationship exists between pH and the acids, as pH levels rise as acid levels
decrease. Additionally, there is a positive correlation between density and the
acids due the chemical properties that exist between an acid molecule and the
surrounding substance. Strong correlations (>0.5) were discovered between free
Sulfur dioxide and total Sulfur dioxide, fixed acidity and density, fixed
acidity and pH, as well as fixed acidity and citric acid.

A multiple linear regression model was used on a test set of 321 wine samples,
containing 11 independent variables to predict wine quality. The model performed
very well with a 95% Confidence Interval, p-value <2.2e-16, residual standard
error of 0.6519 on 1266 degrees of freedom, and a F-statistic equal to 62.48 on
11 variables and 1266 DF, concluding the 11 variables account for 35.48% of the
variance in wine quality.

A second multiple linear regression model utilizing the Backward Elimination
method was conducted on a test set of 321 wine samples to optimize the predictor
variables to determine which variables have the strongest statistical
relationship with the dependent variable.

Optimized Multiple Linear Regression Summary:

The Backward Elimination method indicates alcohol, volatile acidity, sulphates,
total sulfur dioxide, chlorides and pH physicochemical properties have a very
strong statistical influence on wine quality with a 95% Confidence Interval, a
p-value <2.2e-16 and a residual standard error of 0.6487 on 1592 degrees of
freedom. These six physicochemical properties account for 34.95% of the variance
of wine quality. The high F-statistic equal to 147.4 and small p-value of
< 2.2e-16 gives sufficient statistical evidence that the six independent
variables predict the quality rating of wine, therefore, the Null Hypothesis
can be rejected.


Final Plots and Summary

Plot One

The quality variable was of numeric type ‘int’ and not conducive for data
analysis. The variable was transformed into a categorical variable. The majority
of the wine samples fall into the “mediocre” quality rating.

Plot Two

The alcohol and sulphate variables have a positive correlation with the quality
variable. Volatile acidity,chlorides and pH have a negative correlation with the
quality rating. Intrestingly, total sulphur dioxide has a normal distribution
with the quality variable. Results show that a good wine will have a high
alcohol percentage, low volatile acid concentration and a Potassium sulphate
concentration of ~15 g/dm³.

Plot Three

The results from the multiple linear regression model show that the alcohol and
volatile acidity have the strongest statistical influence on wine quality. This
plot compares these two variables against wine quality. As you can see, wine
quality has a negative correlation with the independent variables.

Summary

The exploratory data analysis revealed the distributions of the 11 independent
variables, as well as the interactions the physicochemical properties have with
each other. The multivariate analysis focused on the independent variables with
strong correlations (>0.5), results showing the fixed acidity variable, with
three relations, has the greatest number of correlative influence on other
independent variables.

The dependent wine quality variable has a normal distribution with most samples
having a 5-6 quality score. The alcohol content and volatile acidity
concentration have the strongest statistical influence on the dependent variable
with sulphates, total sulfur dioxide, chlorides and pH also having an influence
on the quality of a wine. Interestingly, when the two most influential
variables, alcohol content and volatile acidity, are compared with the wine
quality rating variable results show good wine has an alcohol content >11.5% by
volume and an Acetic Acid concentration <0.5 g/dm³. Future work on this
dataset should include exploring the outliers in this analysis. Why does a wine
with a high alcohol percentage and a high Acetic Acid concentration still
considered a good wine? Is there a unique combination of physicochemical
properties within these samples, which lead to these abnormal quality ratings?

This analysis used a multiple linear regression model to account for 34.95% of
the variance of wine quality. To improve the predictive power of the
mathematical algorithm additional data with a wider spread of quality data
should be used to improve performance results. Moreover, additional predictive
models should be employed, such as Support Vector Machine (SVM), Decision Tree
Regression or K-Nearest Neighbors (KNN) to provide more accurate predictions for
a wine’s quality as a function of the independent physicochemical properties.

The multiple linear regression model determined the independent physicochemical
properties with the highest statistical influence on wine quality are alcohol
percentage, volatile acidity, sulphates, total sulfur dioxide, chlorides and pH.
Sulphates are added to wine and act as an antimicrobial and antioxidant,
signifying good wines will have a Potassium sulphate concentration of
~0.6 g/dm³. Furthermore, it was discovered good wines contain low quantities
of chlorides, total Sulfur dioxide and pH. The independent variables that have
the maximum statistical influence on a wine quality are volatile acidity and
alcohol percentage. Therefore, good wines will consist of a high alcohol
percentage and a low concentration of volatile acids, which give wines an
unpleasant, vinegar taste. This analysis exposes a strong correlative
relationship between the physicochemical properties of alcohol and volatile
acidity, thus, demonstrating the importance of a wine to be free of
imperfections.


Reflection

I chose the Red Wine Quality dataset to get a better understanding of how the
physiochemical compounds found in wine affect a wine’s quality. Home brewing
wine is on my bucket list and one of the goals of this data analysis was to
learn what makes a quality wine so that I may implement my findings in the
future. I consider this a success data analysis because I was able to explore
the independent variables and compare them with the dependent quality variable
and gain a clear understanding of the influences that affect wine quality.

The challenge I experienced with this analysis involved implementing the
mathematical algorithm correctly. I initially wanted to create three different
algorithms, a Support Vector Machine (SVM), Decision Tree Regression and
K-Nearest Neighbors (KNN) model and compare the results to provide more accurate
predictions for a wine’s quality as a function of the independent
physicochemical properties. I’m not as familiar with R as I am with Python, and
learning the syntax for the machine learning algorithm took too much time,
therefore I decided to simplify matters and conduct a multiple linear regression
model. The linear regression model did very well and I’m proud of how successful
it performed; the next step in the project would be to implement the SVM, KNN
and Decision Tree Regression algorithms for a more robust machine learning
analysis.

It was exciting to investigate the independent variables for relationships
affecting wine quality. I learned how to make beautiful multivariate graphs and
I created my first multiple linear regression model using R. Overall, I’m
extremely proud of the work I have accomplished with this project.


References

1. Paulo Cortez, António Cerdeira, Fernando Almeida, Telmo Matos, José Reis,
    Modeling wine preferences by data mining from physicochemical properties, 
    In Decision Support Systems, Volume 47, Issue 4, 2009, Pages 547-553,ISSN               
    0167-9236, https://doi.org/10.1016/j.dss.2009.05.016. 
    
    (http://www.sciencedirect.com/science/article/pii/S0167923609001377)

2.  Dataset link: http://www3.dsi.uminho.pt/pcortez/dss09.bib

3.  http://r4stats.com/examples/graphics-ggplot2/

4.  http://datadrivenjournalism.net/resources/when_should_i_use_logarithmic_
    scales_in_my_charts_and_graphs

5.  https://www.r-bloggers.com/multiple-regression-lines-in-ggpairs/

6.  https://stat.ethz.ch/R-manual/R-devel/library/base/html/levels.html

7.  https://www.r-bloggers.com/from-continuous-to-categorical/

8.  http://www.shonscience.com/unit-1-earth-as-a-system2/does-the-shape-size
    -or-temperature-of-matter-affect-its-density

9.  https://machinelearningmastery.com/pre-process-your-dataset-in-r/

10.  http://www.stat.columbia.edu/~martin/W2024/R6.pdf

11.  http://data.library.virginia.edu/diagnostic-plots/

12.  https://www.stat.berkeley.edu/classes/s133/Lr.html

13.  http://www.public.iastate.edu/~maitra/stat501/lectures/Outliers.pdf